Contents

0.0.1 Abstract

A central task in genomic data analyses for stratified medicine is class discovery which is accomplished through clustering. However, a critical problem with current approaches, is they do not use a reference datset to test the null hypothesis and derive a p value. We chose to reinvent the popular consensus clustering algorithm to include a hypothesis testing framework and by doing so developed Monte Carlo Consensus Clustering (M3C). M3C use a multi-enabled monte carlo simulation to generate a distribution of stability scores from a null dataset with the same gene-gene correlation structure as the real. This allows derivation of A) the relative cluster stability index and B) an empirical p value. A beta distribution is used to cheaply estimate p values below the minimum simulated PAC. Our new framework improves accuracy, permits rejection of the null hypothesis K = 1, removes systematic bias, and provides confidence in the form of a p value for different clustering solutions. Our approach deals with a number of major pitfalls in current clustering algorithms.

0.0.2 Prerequisites

M3C recommended spec:

A relatively new and fast multi-core computer or cluster.

M3C requires:

A matrix or data frame of normalised expression data (e.g. microarray or RNA-seq) where columns equal samples and rows equal features.

The data should be filtered to remove features with no or very low signal, and filtered for variance to reduce dimensionality (unsupervised), or p value from a statistical test (supervised).

M3C also accepts optionally:

Annotation data frame, where every row is a patient/sample, columns refer to meta-data e.g. age, sex. M3C will automatically rearrange your annotation to match the clustering output and add the consensus cluster to it. Note, this only works if the IDs (column names in data) match a column called “ID” in the annotation data frame.

0.0.3 Example workflow

The M3C package contains the GBM cancer microarray dataset for testing. There is an accepted cluster solution of 4. First we load the package which also loads the GBM data.

library(M3C)
library(NMF) # loading for aheatmap plotting function
library(gplots) # loading this for nice colour scale
library(ggsci) # more cool colours

# now we have loaded the mydata and desx objects (with the package automatically)
# mydata is the expression data for GBM
# desx is the annotation for this data

0.0.4 Running M3C

Next, we run the algorithm with 100x monte carlo iterations and 100x inner replications for the real data and reference. The iterations parameter can be increased to 1000 later for a final analysis. Plots from the tool and an .csv file with the numerical outputs are printed into the working directory (printres = T). We will set the seed in this example, incase you wish to repeat our results exactly. We will add the annotation file for more streamlined plotting later on (des = desx).

We always recommend saving the workspace after M3C if running higher numbers of iterations, because the runtimes can be quite long. We have removed hierarchical clustering from this algorithm and recommend against k means, because they performed either with poor accuracy in our simulations or too slow relative to PAM, respectively. M3C uses PAM with Euclidean distance. The reference method that generates the null datasets is best left to the default setting which is ‘reverse-PCA’ that maintains the correlation structure of the data using the principle component loadings. We will set the removeplots option to TRUE in this example to remove plots from the vignette, normally this is FALSE by default.

res <- M3C(mydata, cores=1, iters=100, ref_method = 'reverse-pca', montecarlo = TRUE,
                         printres = TRUE, maxK = 10, showheatmaps = FALSE, repsreal = 100, repsref = 100,
                         printheatmaps = FALSE, seed = 123, des = desx, removeplots = TRUE)

The scores and p values are contained within the res$scores object. So we can see the RCSI reaches a maxima at K = 4 (RSCI=0.33), the monte carlo p value supports this decision (p=0.033). This means we can reject the null hypothesis that K = 1 for this dataset because we have achieved significance versus a dataset with no clusters (but with identical gene-gene correlation structure). For p values that can extend beyond the lower limits imposed by the monte carlo simulation we use estimate parameters from the simulation to generate a beta distribution, the BETA_P in this case study is 0.033. We may of course want to take a look at other significant or near significant clustering solutions and how they relate to our variables of investigation.

res$scores
##    K  PAC_REAL   PAC_REF        RCSI MONTECARLO_P     BETA_P   P_SCORE
## 1  2 0.6734694 0.5773796 -0.15394262   1.31372549 0.69721054 0.1566361
## 2  3 0.4473469 0.5410857  0.19024326   0.37254902 0.19923640 0.7006313
## 3  4 0.3404082 0.4711755  0.32508528   0.07843137 0.03313314 1.4797374
## 4  5 0.3200000 0.4018041  0.22764361   0.17647059 0.07848793 1.1051971
## 5  6 0.3102041 0.3474204  0.11330519   0.47058824 0.22889997 0.6403543
## 6  7 0.2840816 0.3031673  0.06502332   0.62745098 0.33674564 0.4726980
## 7  8 0.2653061 0.2700408  0.01768878   0.90196078 0.46514316 0.3324134
## 8  9 0.2465306 0.2413469 -0.02125070   1.11764706 0.56967334 0.2443741
## 9 10 0.2130612 0.2190531  0.02773443   0.88235294 0.44629390 0.3503790

Next, we will take a look at some of the plots M3C generates.

This is a CDF plot of the GBM data we feed into the algorithm. We are looking for the value of K with the flattest curve and this can be quantified using the PAC metric (Șenbabaoğlu et al., 2014). Note, we have removed delta K from the algorithm because as we and others (Șenbabaoğlu et al., 2014) have recently shown, it performs badly. In the CDF plot we can see the overfitting effect of consensus clustering where as K increases so does the apparent stability of the solution, this we correct for by using a reference.

Figure 1: CDF plot for real data

This figure below is the PAC score, we can see an elbow at K = 4 which is suggestive this is the best K. However, we are not able to quantify how confident we are in this value without comparison versus a null dataset. Indeed, this may just be the result of random noise.

Figure 2: PAC score for real data

We then derive the RCSI which takes into account the reference PAC scores, but not the shape of the distribution. This metric is better than the PAC score for deciding class number, where the maxima is the number of clusters in the data. In this example the RCSI has an optima at K=4. It generally requires fewer iterations than deriving the empirical p values and it inspired by the GAP-statistic.

Figure 3: RCSI

Finally we calculate a p value from the distribution, here we display the p values from the beta distribution, note the monte carlo p values are slightly better, but the method tends to take longer to compute. If none of the p values reach significance over a reasonable range of K (e.g. 10), then we accept the null hypothesis. In the GBM dataset we can see K = 4 reaches signfiicance with an alpha of 0.05 (red dotted line), therefore we can reject the null hypothesis for the GBM dataset.

Figure 4: Empirical p values

Now we are convinced there are 4 clusters within this dataset which are not likely simply to have occurred by chance alone we can turn to examine the output objects that M3C generates. These facilitate clustering and heatmap rendering for publication.

0.0.5 Understanding M3C outputs

The first 3 lines below extract the expression data and the annotation data from the results object after running M3C. If we wanted to extract the data for 5 clusters, we would simply replace in 4 in the below lines to 5 (and so on). We scale the data here row wise according to z-score prior to some light compression for visualisation purposes in the heatmap. Remember to set Colv = NA for heatmap plotting because we have already ordered the data column or samplewise, M3C does that for you.

# get the data out of the results list (by using $ - dollar sign)
data <- res$realdataresults[[4]]$ordered_data # this is the data
annon <- res$realdataresults[[4]]$ordered_annotation # this is the annotation
ccmatrix <- res$realdataresults[[4]]$consensus_matrix # this is the consensus matrix

# normalise and scale the data
data <- t(scale(t(data))) # z-score normalise each row (feature)
data <- apply(data, 2, function(x) ifelse(x > 4, 4, x)) # compress data within range
data <- apply(data, 2, function(x) ifelse(x < -4, -4, x)) # compress data within range

# get some cool colour palettes from the ggsci package and RColourBrewer
ann_colors <- ggsci::pal_startrek("uniform")(4) # star trek palette
ann_colors2 <- ggsci::pal_futurama()(4) # futurama palette
pal <- rev(colorRampPalette(RColorBrewer::brewer.pal(10, "RdBu"))(256))
NMF::aheatmap(data, annCol = annon, scale = 'row', Colv = NA, distfun = 'pearson', 
         color = gplots::bluered(256), annColors = list(class=ann_colors, consensuscluster=ann_colors2))

Figure 5: aheatmap of GBM consensus clusters with tumour classification

The last thing we may want to do for publications is print the consensus matrix for our optimal clustering solution (K = 4), this should be quite crisp reflecting the flat distribution of the CDF. Note, we could have printed all the consensus heatmaps into the current directory using ‘printheatmaps = T’, however, in this case we will show how to extract the matrix, cluster and render it with the aheatmap function from the NMF package. We can see in this heatmap of the consensus matrix the clusters look quite clear supporting our view that there is 4 clusters.

# time to plot the consensus matrix for the optimal cluster decision
ccmatrix <- res$realdataresults[[4]]$consensus_matrix # pull out the consensus matrix from the k = 4 object
pal <- rev(colorRampPalette(RColorBrewer::brewer.pal(9, "Reds"))(256)) # get some nice colours
NMF::aheatmap(ccmatrix, annCol = annon, Colv = NA, Rowv = NA, 
              color = rev(pal), scale = 'none') # plot the heatmap

Figure 6: aheatmap of GBM consensus matrix

So we have now covered the basic use of M3C. Generally, we recommend complexheatmap for rendering of heatmaps as it is more customisable for publications, however, aheatmap (from the NMF package) is also nice and easier to use.

Happy cluster hunting!

We will now turn to an extra function of the package, which can be used for benchmarking clustering tools.

0.0.6 Generating simulated data

We have included a function for generating simulated data as part of the package to test clustering algorithms. This cluster simulator is simple to use. This simulator, using the code below, generates a dataset with 225 samples, 900 features, a radius cut-off for the initial square of 8, a cluster number of 4, a seperation of clusters of 0.75, and a degree of noise added to each co-ordinate of 0.025. After running, a PCA will print of the data so we can visualise the 4 clusters in principle component space.

  res <- clustersim(225, 900, 8, 4, 0.75, 0.025, print = FALSE, seed=123)
  sim <- res$simulated_data # we extract the simulated data from the results list
  mydata <- as.data.frame(sim) # convert to data frame

Figure 7: PCA of a 4 cluster simulated dataset

0.0.7 Conclusions

In this tutorial, we have seen that M3C can provide statistical validation of clustering results as well as using relative scores. Our Monte Carlo clustering approach is computationally expensive, but this is justified by 4 reasons; 1) Increased performance relative to other methods, 2) The ability to reject the null hypothesis, 3) The ability to assess confidence of using a p value instead of just a relative score, 3) The removal of systematic bias inherant in the results. In our studies of real data, we have seen M3C effectively drives class discovery throughout different stratified medicine projects.

0.0.8 References

Monti, Stefano, et al. “Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data.” Machine learning 52.1 (2003): 91-118.

Șenbabaoğlu, Yasin, George Michailidis, and Jun Z. Li. “Critical limitations of consensus clustering in class discovery.” Scientific reports 4 (2014): 6207.